The Distant Reader Meets Covid-19

Below is a list of tasks/functions which we can implement as a part of a proposal. Many of these tasks/functions may be implemented concurrently, and they are barely prioritized:

 1. Dedicate a single Reader node to serving a full text index of
    virus-related materials, and these materials will initially focus
    on the CORD-19 journal literature dataset. This machine would do
    very nicely at only two or four cores.

 2. Dedicate a few nodes to harvesting and indexing the data for
    Item #1. Indexing is a bit computing heavy.

 3. If Items #1 and #2 are successful, then index additional, but
    more difficult to acquire, journal literature, like content from
    JSTOR or Zotero libraries

 4. Identify the likes of Team JAMS (the good students who won the
    PEARC hack-a-thon), and have these people use the Reader's output
    (ngrams, parts-of-speech, grammars, named-entities, etc) as the
    input for things like discovering relationships between drugs,
    "correlations of language" between articles, or visualizations of
    the underlying data such timelines, geographic maps, or network
    diagrams.

 5. Modify the Reader's code to use a biomedical language model
    instead of the existing English language model.

 6. Modify the Reader's code so the feature-extraction tasks are
    more distinct from the report generation tasks, thus we can
    divide and conquer when it comes to report generation.

 7. Work on a subset of the CORD dataset, and get the subset
    working

 8. If Items #5, #6, and #7 are successful, then increase the
    carrel's content to include all the CORD content.

 9. If Item #8 is successful, then include content from Item #3

10. Implement a better, more interactive topic modeling
    interface. Just as everybody likes to search, topic modeling is
    very popular in text studies.

11. Integrate the full text indexing and the topic modeling
    interface into the Reader's study carrel thus creating a coherent
    whole. For example: search index, create subset, and topic model
    it. For example: peruse study carrel, identify thing of interest,
    link it to full text index or topic model, and return thing of
    interest in the context of the original article. Etc.


What might be needed to do this work? Some of it might include:

 1. A re-allocation of existing cores, thus some systems
    administration

 2. A re-examination of the shared file system because my
    antidotal observations see a lot of time is spent on disk I/O

 3. Hacker(s) who can read delimited files over the 'Net (or a
    relational database file), parse it, ask questions of it, and
    visualize the results.

 4. Content experts who can evaluate the output of everything above.

 5. Time.


Here is a list of nice-to-have items -- people:

 1. Someone who really knows Solr -- the full text indexer of
    choice

 2. Someone who really knows relational databases -- because a
    whole lot of the data is ultimately stored in one

 3. Someone who can write interactive Web-pages to... interact
    with the underlying data in real time.

 4. People to create additional study carrels of related content,
    and then we can meld the resulting carrels togehter.


The above is only a set of suggestions. I hope they give you ideas, and I hope you are available to chat on Friday at 9 o'clock. 

In any event, additional suggestions and reactions are welcome.

--
Eric Morgan and Team Reader
March 23, 2020